Saltwater intrusion and sea level rise (SWISLR) are complex and intertwined processes that threaten coastal infrastructure and erode biodiverse shoreline habitats. In simpler terms, SWISLR threatens the water quality of coastal communities by flooding sewage infrastructure, leaching toxins into groundwater sources, and contaminating well water.
Research can help scientists better understanding how these processes work and inspire new solutions or enhance previous ones to better equip frontline communities and stakeholders with the support and tools to tackle climate change, especially in the most vulnerable communities.
A North Carolina wetland.
Although having every scientist conducting their own fieldwork is both invaluable and rewarding, it’s also expensive. That makes it imperative that scientists are able to share their data in accessible ways so that everyone can both contribute and benefit from open-access data.
One of the largest repositories of water quality data is the United States Geological Survey’s (USGS) National Water Information System (NWIS) and the Environmental Protection Agency’s (EPA) STOrage and RETrieval (STORET) database. Both of these contain hundreds of thousands of data points that can enhance current discourse on solutions to SWISLR.
The goal of this document is to understand the applications and limitations of the USGS NWIS repository, and evaluate areas of improvement to currently available data. This file contains the code that was used to analyze the data, and visualizations to to describe the data.
The key goals are:
# First install the remotes package
install.packages("remotes")
# Install HASP package
remotes::install_gitlab("water/stats/hasp",
host = "code.usgs.gov",
build_opts = c("--no-resave-data","--no-manual"),
build_vignettes = TRUE,
dependencies = TRUE)
library(tidyverse)
library(remotes)
library(HASP)
library(dataRetrieval)
library(dplyr)
library(ggplot2)
library(sf)
library(olsrr)
library(maps)
library(DataExplorer)
library(nycflights13)
library(knitr)
library(lubridate)
library(dygraphs)
library(xts)
library(zoo)
library(stats)
| PCode | Type | Description | Sample Type | Temperature | Units |
|---|---|---|---|---|---|
| 00094 | Physical | Specific conductance, water, unfiltered, field, microsiemens per centimeter at 25 degrees Celsius | Total | 25 deg C | uS/cm (25C?) |
| 00095 | Physical | Specific conductance, water, unfiltered, microsiemens per centimeter at 25 degrees Celsius | Total | 25 deg C | uS/cm (25C?) |
| 00402 | Physical | Specific conductance, non-temperature corrected, water, unfiltered, microsiemens per centimeter | Total | uS/cm | |
| 70386 | Physical | Salinity, water, computed by regression equation from specific conductance, milligrams per liter | Total | mg/l | |
| 72430 | Physical | Specific conductance, water, filtered, field, microsiemens per centimeter at 25 degrees Celsius | Dissolved | 25 deg C | uS/cm (25C?) |
# Pulling site location information data in North Carolina
pCode <- c("00094", "00095", "00402", "70386", "72430", "90094", "90095",
"90096", "99974", "99978", "99982")
# Pulling USGS repository data that is in North Carolina and contains the parameters we want
scNC <- whatNWISdata(stateCd = "NC",
parameterCd = pCode)
# Creating new columns with the start and end year only
scNC <- scNC %>%
separate(begin_date, into = c("Year_start"), sep = "-", remove = FALSE) %>%
separate(end_date, into = c("Year_end"), sep = "-", remove = FALSE)
# Adding a "period" column and filtering for sites with nitrate measurements starting from 2010
scNC_1 <- scNC %>%
mutate(period = as.Date(end_date) - as.Date(begin_date)) %>%
filter(Year_end > 2000)
| Station Name | Site Number | Parameter Code | Start Year | End Year |
|---|---|---|---|---|
| NORTHWEST RIVER ABOVE MOUTH NEAR MOYOCK, NC | 02043410 | 00095 | 2006 | 2007 |
| NORTHWEST RIVER ABOVE MOUTH NEAR MOYOCK, NC | 02043410 | 00095 | 2006 | 2007 |
| NORTHWEST RIVER ABOVE MOUTH NEAR MOYOCK, NC | 02043410 | 00095 | 2006 | 2007 |
| NORTHWEST RIVER ABOVE MOUTH NEAR MOYOCK, NC | 02043410 | 00095 | 2006 | 2007 |
| TULL CREEK AT SR 1222 NEAR CURRITUCK, NC | 02043415 | 00095 | 2006 | 2007 |
By now, the only two parameters in our data are “00095” and “90095”
# Check parameters
unique(scNC_1$parm_cd)
## [1] "00095" "90095"
# Get nitrate measurements
sc_sites <- scNC_1$site_no
sc_data <- readNWISqw(siteNumbers = sc_sites, parameterCd = pCode)
# Tidy data
sc_data <- sc_data %>%
separate(sample_dt, into = c("Year_start"), sep = "-", remove = FALSE) %>%
filter(Year_start > 2000)
# View the data structure
plot_str(sc_data)
# Let's take a look at some information about our data frame
data_structure_1 <- introduce(sc_data)
| Category | Data |
|---|---|
| Rows | 14903 |
| Columns | 38 |
| Discrete Columns | 32 |
| Continuous Columns | 5 |
| All Missing Columns | 1 |
| Total Missing Values | 244882 |
| Complete Rows | 0 |
| Total Observations | 566314 |
| Memory Usage | 5310640 |
# Putting what we know in a visual format
data_structure_2 <- plot_intro(sc_data)
# Understanding the data we don't have access to
data_structure_3 <- plot_missing(sc_data) +
labs(title = "Percentage of Observations Missing")
# Create a map of all the locations where nitrate in water is measured
map_scNC <- ggplot() +
geom_sf(data = usa[ usa$ID == "north carolina" ,]) +
geom_sf(data = sf_scNC_1) +
xlab(NULL)+
ylab(NULL)+
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
labs(title = "Sites that Measure Specific Conductance of Water in North Carolina since 2000",
caption = "United States Geological Survey, 2024.")
map_scNC
# Parameter code bar graph
ggplot(freq, aes(x = Frequency, y = factor(parm_cd))) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Horizontal Bar Graph with Frequency", x = "Frequency", y = "Parameter") +
theme_minimal()
# Sample date graph
date_counts <- table(sc_data$sample_dt)
barplot(date_counts,
main = "Frequency of Sample Dates",
xlab = "Date",
ylab = "Frequency",
col = "skyblue")
plot_n <- ggplot(data = sc_data, aes(x = sample_dt, y = result_va)) +
geom_point() +
scale_y_log10() +
labs(title = "Specific Conductance levels in North Carolina Waters",
x = "Year",
y = "Specific Conductance in us/cm at 25 degrees C",
caption = "USGS, 2024.") +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE)
plot_n
# Import fixed data
new_sc_counties <- read_csv("scNC_1.csv")
## New names:
## Rows: 2076 Columns: 29
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (15): agency_cd, site_no, station_nm, site_tp_cd, coord_acy_cd, dec_coo... dbl
## (12): ...1, dec_lat_va, dec_long_va, alt_va, alt_acy_va, ts_id, srs_id,... date
## (2): begin_date, end_date
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Add new column
scNC_1$counties <- new_sc_counties$county
# Filtering by CAMA counties
counties_of_interest <- c("beaufort", "bertie", "brunswick",
"camden", "carteret", "chowan", "craven", "currituck",
"dare", "gates", "hertford", "hyde", "new hanover",
"onslow", "pamlico", "pasquotank", "pender", "perquimans",
"tyrrell", "washington")
scNC_coastal_counties <- subset(scNC_1, counties %in% counties_of_interest)
# Filter by year
scNC_coastal_counties <- scNC_coastal_counties %>%
filter(Year_start > 2000)
# Pull coastal measurements
sc_coastal_data <- readNWISqw(siteNumbers = scNC_coastal_counties$site_no,
parameterCd = pCode)
# Filter again by year
sc_coastal_data <- sc_coastal_data %>%
separate(sample_dt, into = c("Year_start"), sep = "-", remove = FALSE) %>%
filter(Year_start > 2000)
| Agency | Site Number | Sample Date | Sample Time | Parameter Code | Result |
|---|---|---|---|---|---|
| USGS | 02043750 | 2012-08-15 | 14:38 | 00095 | 101 |
| USGS | 362925076281300 | 2012-08-15 | 13:20 | 00095 | 88 |
| USGS | 362609076295100 | 2012-08-15 | 14:20 | 00095 | 85 |
| USGS | 362827076294900 | 2012-08-15 | 14:10 | 00095 | 93 |
| USGS | 362838076271400 | 2012-08-15 | 13:00 | 00095 | 91 |
# View the data structure
plot_str(sc_coastal_data)
# Let's take a look at some information about our data frame
introduce(sc_coastal_data)
| Category | Data |
|---|---|
| Rows | 1086 |
| Columns | 37 |
| Discrete Columns | 22 |
| Continuous Columns | 1 |
| All Missing Columns | 14 |
| Total Missing Values | 16858 |
| Complete Rows | 0 |
| Total Observations | 40182 |
| Memory Usage | 381024 |
# Putting what we know in a visual format
data_structure_2 <- plot_intro(sc_data)
# Understanding the data we don't have access to
data_structure_3 <- plot_missing(sc_data) +
labs(title = "Percentage of Observations Missing")
# Create a map of all the locations where nitrate in water is measured
map_scNC <- ggplot() +
geom_sf(data = usa[ usa$ID == "north carolina" ,]) +
geom_sf(data = sf_scNC_coast) +
xlab(NULL)+
ylab(NULL)+
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
labs(title = "Sites that Measure Specific Conductance of Water in North Carolina since 2000",
caption = "United States Geological Survey, 2024.")
map_scNC
# Plot data
plot_sc_coastal <- ggplot(data = sc_coastal_data,
aes(x = sample_dt, y = result_va)) +
geom_point() +
scale_y_log10() +
geom_smooth(method = "loess") +
labs(title = "Specific Conductance in North Carolina Coastal Waters since 2000",
x = "Date",
y = "Specific Conductance in uS/cm")
plot_sc_coastal
## `geom_smooth()` using formula = 'y ~ x'
# Get data
sc_coastal_field <- sc_coastal_data %>%
subset(parm_cd == "00095")
# Plot data
plot_sc_field <- ggplot(data = sc_coastal_field,
aes(x = sample_dt, y = result_va)) +
geom_point() +
geom_smooth(method = "loess")
plot_sc_field
## `geom_smooth()` using formula = 'y ~ x'
# Get data
sc_coastal_unfiltered <- sc_coastal_data %>%
subset(parm_cd == "90095")
# Plot data
plot_sc_unfiltered <- ggplot(data = sc_coastal_unfiltered,
aes(x = sample_dt, y = result_va)) +
geom_point() +
geom_smooth(method = "loess")
plot_sc_unfiltered
## `geom_smooth()` using formula = 'y ~ x'
plot_correlation(correlation_coastal_sc, type = "d")
## 4 features with more than 20 categories ignored!
## site_no: 261 categories
## sample_dt: 316 categories
## sample_tm: 195 categories
## project_cd: 21 categories
Here are the samples by their identified data quality indicator. This measure indicates whether a sample is deemed accurate by hydrologists at the USGS. Initially after a sample is taken, it is deemed “presumed satisfactory” (S), and then after review by a hydrologist, only is it considered “reviewed and accepted” (R).
ggplot(freq_5, aes(x = Frequency, y = factor(dqi_cd))) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Samples by Data Quality Indicator", x = "Frequency", y = "DQI") +
theme_minimal()
Most of the samples were reviewed and accepted by a hydrologist, so any outliers are likely not by mistake.
ggplot(freq_7, aes(x = Frequency, y = factor(parm_cd))) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Samples by Parameter", x = "Frequency", y = "Hydrological Condition") +
theme_minimal()
# Sample date graph
date_counts <- table(sc_coastal_data$sample_dt)
barplot(date_counts,
main = "Frequency of Sample Dates",
xlab = "Date",
ylab = "Frequency",
col = "skyblue")
barplot(time_counts,
main = "Frequency of Samples Taken at Different Times",
xlab = "Time",
ylab = "Frequency",
col = "steelblue",
las = 2, # Rotate x-axis labels vertically for better readability
cex.names = 0.8, # Adjust label size
xlim = c(0.5, length(time_counts) + 0.5), # Adjust x-axis limits for better spacing
ylim = c(0, max(time_counts) + 1), # Adjust y-axis limits
names.arg = names(time_counts), # Use actual time strings for x-axis labels
args.legend = list(text.width = 0)
)
ggplot(freq_3, aes(x = Frequency, y = factor(hyd_event_cd))) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Samples by Hydrological Event", x = "Frequency", y = "Hydrological Event") +
theme_minimal()
ggplot(freq_2, aes(x = Frequency, y = factor(hyd_cond_cd))) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Samples by Hydrological Condition", x = "Frequency", y = "Hydrological Condition") +
theme_minimal()
# Then you can create the xts necessary to use dygraph
don <- xts(x = sc_coastal_data$result_va, order.by = sc_coastal_data$sample_dt)
# Finally the plot
p <- dygraph(don) %>%
dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
dyRangeSelector() %>%
dyCrosshair(direction = "vertical") %>%
dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE) %>%
dyRoller(rollPeriod = 1)
p
The maximum specific conductance value recorded is 53,600 uS/cm at 25 degrees Celsius. This makes sense because seawater has a specific conductance of around 50,000 uS/cm at 25 degrees Celsius.
group_a <- sc_coastal_data$result_va
group_b <- non_coastal_data$result_va
# Perform the independent sample t-test
t_test_result <- t.test(group_a, group_b)
# Print the t-test result
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: group_a and group_b
## t = -0.095602, df = 1051, p-value = 0.9239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -159.8141 144.9648
## sample estimates:
## mean of x mean of y
## 542.3087 549.7334
Freshwater is typically between 0 and 1,500 uS/cm. Potable water in the U.S. is between 30-1,500 uS/cm.
So far, the data that I’ve seen doesn’t match the literature I’ve read while creating the SaltWatch Dashboard. However, from all the factors we saw, it could be because of different conditions at each site and time when samples were taken.
Next steps: